Search CORE

240 research outputs found

Learning compact hashing codes with complex objectives from multiple sources for large scale similarity search

Author: Wang Qifan
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2015
Field of study

Similarity search is a key problem in many real world applications including image and text retrieval, content reuse detection and collaborative filtering. The purpose of similarity search is to identify similar data examples given a query example. Due to the explosive growth of the Internet, a huge amount of data such as texts, images and videos has been generated, which indicates that efficient large scale similarity search becomes more important.^ Hashing methods have become popular for large scale similarity search due to their computational and memory efficiency. These hashing methods design compact binary codes to represent data examples so that similar examples are mapped into similar codes. This dissertation addresses five major problems for utilizing supervised information from multiple sources in hashing with respect to different objectives. Firstly, we address the problem of incorporating semantic tags by modeling the latent correlations between tags and data examples. More precisely, the hashing codes are learned in a unified semi-supervised framework by simultaneously preserving the similarities between data examples and ensuring the tag consistency via a latent factor model. Secondly, we solve the missing data problem by latent subspace learning from multiple sources. The hashing codes are learned by enforcing the data consistency among different sources. Thirdly, we address the problem of hashing on structured data by graph learning. A weighted graph is constructed based on the structured knowledge from the data. The hashing codes are then learned by preserving the graph similarities. Fourthly, we address the problem of learning high ranking quality hashing codes by utilizing the relevance judgments from users. The hashing code/function is learned via optimizing a commonly used non-smooth non-convex ranking measure, NDCG. Finally, we deal with the problem of insufficient supervision by active learning. We propose to actively select the most informative data examples and tags in a joint manner based on the selection criteria that both the data examples and tags should be most uncertain and dissimilar with each other.^ Extensive experiments on several large scale datasets demonstrate the superior performance of the proposed approaches over several state-of-the-art hashing methods from different perspectives

Purdue E-Pubs

Recommended from our members

Impacts of florfenicol on the microbiota landscape and resistome as revealed by metagenomic analysis.

Author: Liao Chao
Terhune Jeffery
Wang Luxin
Zeng Qifan
Publication venue: eScholarship, University of California
Publication date: 01/12/2019
Field of study

BACKGROUND:Drug-resistant fish pathogens can cause significant economic loss to fish farmers. Since 2012, florfenicol has become an approved drug for treating both septicemia and columnaris diseases in freshwater fish. Due to the limited drug options available for aquaculture, the impact of the therapeutical florfenicol treatment on the microbiota landscape as well as the resistome present in the aquaculture farm environment needs to be evaluated. RESULTS:Time-series metagenomic analyses were conducted to the aquatic microbiota present in the tank-based catfish production systems, in which catfish received standard therapeutic 10-day florfenicol treatment following the federal veterinary regulations. Results showed that the florfenicol treatment shifted the structure of the microbiota and reduced the biodiversity of it by acting as a strong stressor. Planctomycetes, Chloroflexi, and 13 other phyla were susceptible to the florfenicol treatment and their abundance was inhibited by the treatment. In contrast, the abundance of several bacteria belonging to the Proteobacteria, Bacteroidetes, Actinobacteria, and Verrucomicrobia phyla increased. These bacteria with increased abundance either harbor florfenicol-resistant genes (FRGs) or had beneficial mutations. The florfenicol treatment promoted the proliferation of florfenicol-resistant genes. The copy number of phenicol-specific resistance genes as well as multiple classes of antibiotic-resistant genes (ARGs) exhibited strong correlations across different genetic exchange communities (p < 0.05), indicating the horizontal transfer of florfenicol-resistant genes among these bacterial species or genera. Florfenicol treatment also induced mutation-driven resistance. Significant changes in single-nucleotide polymorphism (SNP) allele frequencies were observed in membrane transporters, genes involved in recombination, and in genes with primary functions of a resistance phenotype. CONCLUSIONS:The therapeutical level of florfenicol treatment significantly altered the microbiome and resistome present in catfish tanks. Both intra-population and inter-population horizontal ARG transfer was observed, with the intra-population transfer being more common. The oxazolidinone/phenicol-resistant gene optrA was the most prevalent transferred ARG. In addition to horizontal gene transfer, bacteria could also acquire florfenicol resistance by regulating the innate efflux systems via mutations. The observations made by this study are of great importance for guiding the strategic use of florfenicol, thus preventing the formation, persistence, and spreading of florfenicol-resistant bacteria and resistance genes in aquaculture

eScholarship - University of California

Autoregressive Entity Generation for End-to-End Task-Oriented Dialog

Author: Huang Guanhuan
Quan Xiaojun
Wang Qifan
Publication venue
Publication date: 18/09/2022
Field of study

Task-oriented dialog (TOD) systems often require interaction with an external knowledge base to retrieve necessary entity (e.g., restaurant) information to support the response generation. Most current end-to-end TOD systems either retrieve the KB information explicitly or embed it into model parameters for implicit access.~While the former approach demands scanning the KB at each turn of response generation, which is inefficient when the KB scales up, the latter approach shows higher flexibility and efficiency. In either approach, the systems may generate a response with conflicting entity information. To address this issue, we propose to generate the entity autoregressively first and leverage it to guide the response generation in an end-to-end system. To ensure entity consistency, we impose a trie constraint on entity generation. We also introduce a logit concatenation strategy to facilitate gradient backpropagation for end-to-end training. Experiments on MultiWOZ 2.1 single and CAMREST show that our system can generate more high-quality and entity-consistent responses.Comment: Accepted to COLING 202

arXiv.org e-Print Archive

Disentangled Phonetic Representation for Chinese Spelling Correction

Author: Liang Zihong
Quan Xiaojun
Wang Qifan
Publication venue
Publication date: 24/05/2023
Field of study

Chinese Spelling Correction (CSC) aims to detect and correct erroneous characters in Chinese texts. Although efforts have been made to introduce phonetic information (Hanyu Pinyin) in this task, they typically merge phonetic representations with character representations, which tends to weaken the representation effect of normal texts. In this work, we propose to disentangle the two types of features to allow for direct interaction between textual and phonetic information. To learn useful phonetic representations, we introduce a pinyin-to-character objective to ask the model to predict the correct characters based solely on phonetic information, where a separation mask is imposed to disable attention from phonetic input to text. To avoid overfitting the phonetics, we further design a self-distillation module to ensure that semantic information plays a major role in the prediction. Extensive experiments on three CSC benchmarks demonstrate the superiority of our method in using phonetic information.Comment: Accepted to ACL 2023 Main Conferenc

arXiv.org e-Print Archive

Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

Author: Chen Yujie
Song Qifan
Wang Ziyi
Zhang Ruqi
Publication venue
Publication date: 24/10/2023
Field of study

Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy. Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy. This paper investigates low-precision sampling via Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradient accumulators for both strongly log-concave and non-log-concave distributions. Theoretically, our results show that, to achieve

\epsilon

-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement (

\widetilde{\mathbf{O}}\left({\epsilon^{-2}{\mu^*}^{-2}\log^2\left({\epsilon^{-1}}\right)}\right)

) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) (

\widetilde{\mathbf{O}}\left({{\epsilon}^{-4}{\lambda^{*}}^{-1}\log^5\left({\epsilon^{-1}}\right)}\right)

). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise. Empirically, we conduct experiments on synthetic data, and {MNIST, CIFAR-10 \& CIFAR-100} datasets, which validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited machine learning

arXiv.org e-Print Archive

Federated Generalization via Information-Theoretic Distribution Diversification

Author: Wang Qifan
Wu Zheshun
Xu Zenglin
Zeng Dun
Publication venue
Publication date: 12/10/2023
Field of study

Federated Learning (FL) has surged in prominence due to its capability of collaborative model training without direct data sharing. However, the vast disparity in local data distributions among clients, often termed the non-Independent Identically Distributed (non-IID) challenge, poses a significant hurdle to FL's generalization efficacy. The scenario becomes even more complex when not all clients participate in the training process, a common occurrence due to unstable network connections or limited computational capacities. This can greatly complicate the assessment of the trained models' generalization abilities. While a plethora of recent studies has centered on the generalization gap pertaining to unseen data from participating clients with diverse distributions, the divergence between the training distributions of participating clients and the testing distributions of non-participating ones has been largely overlooked. In response, our paper unveils an information-theoretic generalization framework for FL. Specifically, it quantifies generalization errors by evaluating the information entropy of local distributions and discerning discrepancies across these distributions. Inspired by our deduced generalization bounds, we introduce a weighted aggregation approach and a duo of client selection strategies. These innovations aim to bolster FL's generalization prowess by encompassing a more varied set of client data distributions. Our extensive empirical evaluations reaffirm the potency of our proposed methods, aligning seamlessly with our theoretical construct

arXiv.org e-Print Archive

Attack Prompt Generation for Red Teaming and Defending Large Language Models

Author: Deng Boyi
Deng Yang
Feng Fuli
He Xiangnan
Wang Qifan
Wang Wenjie
Publication venue
Publication date: 19/10/2023
Field of study

Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. Our code and dataset is available on https://github.com/Aatrox103/SAP .Comment: Accepted to EMNLP 2023 (Findings

arXiv.org e-Print Archive

Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation

Author: Feng Fuli
He Xiangnan
Hu Xunhan
Wang Chenxu
Wang Qifan
Zhang Yang
Publication venue
Publication date: 27/07/2023
Field of study

Historical interactions are the default choice for recommender model training, which typically exhibit high sparsity, i.e., most user-item pairs are unobserved missing data. A standard choice is treating the missing data as negative training samples and estimating interaction likelihood between user-item pairs along with the observed interactions. In this way, some potential interactions are inevitably mislabeled during training, which will hurt the model fidelity, hindering the model to recall the mislabeled items, especially the long-tail ones. In this work, we investigate the mislabeling issue from a new perspective of aleatoric uncertainty, which describes the inherent randomness of missing data. The randomness pushes us to go beyond merely the interaction likelihood and embrace aleatoric uncertainty modeling. Towards this end, we propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework that consists of a new uncertainty estimator along with a normal recommender model. According to the theory of aleatoric uncertainty, we derive a new recommendation objective to learn the estimator. As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty, which is demonstrated to improve the recommendation performance of less popular items without sacrificing the overall performance. We instantiate AUR on three representative recommender models: Matrix Factorization (MF), LightGCN, and VAE from mainstream model architectures. Extensive results on two real-world datasets validate the effectiveness of AUR w.r.t. better recommendation results, especially on long-tail items

arXiv.org e-Print Archive

A New Technique for Multispectral and Panchromatic Image Fusion

Author: Hu Yingjie
Jia Zhenhong
Qin Xizhong
Wang Qifan
Yang Jie
Publication venue: Published by Elsevier Ltd.
Publication date: 31/12/2011
Field of study

AbstractIn this paper, a technique is presented for the fusion of Panchromatic (PAN) and low spatial resolution multispectral (MS) images to get high spatial resolution of the latter. In this technique, we apply PCA transformation to the MS image to obtain the principal component (PC) images. A NSCT transformation to PAN and each PC images for N level of decomposition. We use FOCC as criterion to select PC. And then, we use the relative entropy as criterion to reconstruct high-frequency detailed images. Finally, we apply inverse NSCT to selected PC's low-frequency approximate image and reconstructed high- frequency detailed images to obtain high spatial resolution MS image. The experimental results obtained by applying the proposed image fusion method indicate some improvements in the fusion performance

Elsevier - Publisher Connector